Introduction

Our world is rapidly urbanizing in practically all geographies. In 2018, the UN reported that 55% of the global population lived in urban areas and projected that number to rise to 68% by 2050. With this urbanization, access to green space and parkland is certain to decrease across geographies and across populations. With this general decline in access to nature, there is likely to be an increase in the equity of access, with the wealthy and privileged losing access at a lesser rate than the poor and underprivileged. There has been work done to assess the public health impacts of this inequity in the access to nature in the context of COIVD (Spotswood et al., 2021), as well as work done on the “luxury effect,” which shows that wealthy neighborhoods can support higher biodiversity in some cases due largely to the increased presence of park land (Magel et al., 2021). In this analysis, I seek to understanding the relationship between access to green space and exposure to potentially harmful pollutants. I will also examine the relationship between green space and income. The analysis will specifically be done in the Bay Area of California, a collection of 9 counties surrounding San Francisco Bay on the Central Coast of California, USA.

Intuitively, it makes sense to hypothesize that, in an urban setting like the Bay Area, there would be a relationship between exposure to pollution and contamination and the relative urbanization of a location. We see natural landscapes as clean and urban hardscapes as dirty, and can make assumptions as to which location is healthier to live next to. Below I quantify that relationship using the spatial distribution of parks in San Francisco, summary data from the CalEnviroScreen model, and census blocks and census block group geometries.

Methods

In developing this analysi, 4 key datasets are leveraged: 1) US Census American Communities Survey (ACS) Income Dataset, 2) US Census Tiger Geometries, 3) CalEnviroScreen, 4) Public Lands Trust Parks Dataset.

ACS Income

Table B19001 from the US Census Bureau (data.census.gov) is used to determine income levels in each census block. Income levels in this table are presented as discrete counts of the population within certain income bands. To account for this unique arrangement of data, we define our variable of “income” for later regressions as the percent of respondents to the survey that make more than $100k annually (USD). The threshold of $100k is chosen a) as a clean break that is well-defined by the natural arrangement of the income categories and b) the mean income of the Bay Area is roughly $100k.

Census Geometries

Originally, this analysis was intended to be performed with building footprints provided by the city of San Francisco. However, for practical purposes, census geometries were chosen instead. First, the building footprints dataset is extremely large and detailed, and very long processing times presented a limitation early in the development of this analysis, prompting the switch to a dataset with lower resolution and greater spatial extent. Second, the building footprints included residences, commercial structures, government strucutres, and retail structures. Given the urban setting, many (possibly most) of the structures that were residential were multi-family, which would bias the analysis against people living in apartments, duplexs, etc. Third, becasue CalEnviroScreen and the ACS income data are presented at the Census BlockGroup and Block levels, respectively, the building footprints would need to be rolled up to lesser spatial extents anyway. Fourth, because of the roll-up requirement and the limited extent of the building footprints being used (San Francisco), when rolling up to Census geometries, sample size would be diminished or later regressions.

Therefore, the decision was made to focus on Census Blocks (for the ACS Income analysis) and Census BlockGroups (for the CalEnviroScreen analysis) and to expand the analysis to the entire Bay Area of California. This allows for a more generalizable analysis and a more meaningful sample set. When developed, the Blocks dataset was filtered to eliminate all Blocks with a land area value of 0.0 to eliminate the blocks in San Francisco that are all water area. Following this filter, the centroid was calculated for each Block, and this centroid was used to calculate the distance between a Block and the nearest park boundary. For the BlockGroups, the distances were averaged across the Blocks contained, and this average was assigned as the distance for that BlockGroup.

CalEnviroScreen

CalEnviroScreen 4.0 is used as a simplified index of general environmental contaminant exposure. While the index incorporates a variety of stressors and pollutants, the summary index was chosen as the most appropriate indicator, given the wide variety of communities and potential exposures throughout the sample area. The CalEnviroScreen 4.0 data is presented at the Census BlockGroup level, and therefore the summary value is compared against the mean distance to a park from the Block centroids contained within each BlockGroup.

Trust for Public Lands ParkServe Dataset

Park boundaries were downloaded and incorporated as shapefiles provided by the Trust for Public Lands ParkServe Program. No modifications were done to the park boundaries dataset (with the exception of re-projection), and all data was retained for the nine counties encompassing the Bay Area. Importantly, data were not included for the counties bordering the Bay Area, and therefore it is possible that Blocks on the outer boundary of some Bay Area counties were incorrectly attributed with a “nearest park” that was erroneously far from them if the actual nearest park is in a county that was not included. This is assumed to be a negligible portion of the analysis, given the sample size.

General Workflow

  1. Load park geography/geometry shapefiles and clip to Bay Area Counties
  2. Load Census Blocks and BlockGroups for Bay Area/SF
  3. Convert Blocks to Centroids
  4. Calculate linear distance from each Block centroid to the nearest park boundary and attribute Blocks as such
  5. Within each BlockGroup, take the average linear distance for the Blocks within the tracts
  6. Assign the average distance to the park boundary as an independent variable for later regression analyses
  7. Calculate the percentage of the population with an income over $100k, and attribute BlockGroups with CalEnviroScreen summary data
  8. Regression - Regress average distance to parks against Income and CalEnviroScreen

Results

Presented below are the result of our Regression Analyses below in Figures 2-5 and in the accompanying model results summaries. In both regression cases, no significant relationship is seen between distance to the nearest park boundary and CalEnviroScreen summary scores or percentage of ACS respondents with an income over $100k annually. In both cases, a linear model was fit to the data, and appears to be the most appropriate attempt to find a relationship both upon visual analysis of the scatterplots and density distributions of the residuals, which were both generally normal and centered roughly about 0. However, in both instances, R-squared values are both near zero and p-values are well above 0.05, indicating a lack of both predictability and significance.

I reason that this lack of predictive capacity can be explained by a visual analysis of Figure 1, which maps all park land in the Bay Area. As can be seen in this map, the Bay Area has an incredible density in Parks, some small and typical of urban settings, and others extremely large and unique to the geography and topography of the area. Because of this ubiquity in the distribution of parks throughout the nine counties analyzed, it is reasonably expected that linear distance to parks would be relatively uniform accross the entire region.

Fig. 1

A map of all Park Lands in the Bay Area, California, according to the Trust For Public Lands ParkServe Data Program

Fig. 2

Scatterplot and Line of Best Fit for the linear regression between the percentage of a census Block with an income over $100k annually and the linear distance between the centroid of that Block and the nearest Park boundary

Fig. 3

Density plot of the residuals for the linear model. Residuals are roughly normally distributed, demonstrating that a linear regression is likely an appropriate model to understand the relationship

## 
## Call:
## lm(formula = Perc100k ~ Length, data = Blocks_100k_analysis)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.22 -15.10   1.96  15.91  48.17 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.2834595  0.7471319  69.979   <2e-16 ***
## Length      -0.0002559  0.0006484  -0.395    0.693    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.4 on 3929 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  3.965e-05,  Adjusted R-squared:  -0.0002149 
## F-statistic: 0.1558 on 1 and 3929 DF,  p-value: 0.6931

Fig. 4

Scatterplot and Line of Best Fit for the linear regression between the summary score of the CalEnviroScreen 4.0 and the linear distance between the mean distance between Blocks contained within the BlockGroup and the nearest Park boundary

Fig. 5

Density plot of the residuals for the linear model. Residuals are roughly normally distributed, demonstrating that a linear regression is likely an appropriate model to understand the relationship

## 
## Call:
## lm(formula = Score ~ Length, data = CES4_Bay_Tracts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.050  -9.185  -2.688   7.295  46.752 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.0993755  0.8443874   23.80   <2e-16 ***
## Length      -0.0014341  0.0007751   -1.85   0.0645 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.03 on 1556 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.002195,   Adjusted R-squared:  0.001554 
## F-statistic: 3.423 on 1 and 1556 DF,  p-value: 0.06449